69 research outputs found
Performance Characterization of In-Memory Data Analytics on a Modern Cloud Server
In last decade, data analytics have rapidly progressed from traditional
disk-based processing to modern in-memory processing. However, little effort
has been devoted at enhancing performance at micro-architecture level. This
paper characterizes the performance of in-memory data analytics using Apache
Spark framework. We use a single node NUMA machine and identify the bottlenecks
hampering the scalability of workloads. We also quantify the inefficiencies at
micro-architecture level for various data analysis workloads. Through empirical
evaluation, we show that spark workloads do not scale linearly beyond twelve
threads, due to work time inflation and thread level load imbalance. Further,
at the micro-architecture level, we observe memory bound latency to be the
major cause of work time inflation.Comment: Accepted to The 5th IEEE International Conference on Big Data and
Cloud Computing (BDCloud 2015
Code Generation and Run-time Support For Multi-Level Parallelism . . .
In this paper we describe the main components of the NanosCompiler, an OpenMP compiler whose implementation is oriented towards the efficient exploitation of nested parallelism. Program parallelization relies both on the automatic parallelization capabilities of the base compiler and the information obtained from user--supplied directives. The compiler uses a hierarchical internal representation that unifies both sources of parallelism, proceeds with a task identification phase that adapts the granularity of the final tasks to the target architecture and then generates parallel code. The paper also presents an analysis of the special support needed from the threads library level to support this kind of parallelism. These requirements are analyzed in our current implementation named NthLib
Towards an efficient exploitation of loop-level parallelism in Java
This paper analyzes the overheads incurred in the exploitation of loop-level parallelism using Java Threads and purposes some code transformations that minimize them. Avoiding the intensive use of Java Threads and reducing the number of classes used to specify the parallelism in the application results in promising performance gains that may encourage the use of Java for exploiting loop-level parallelism. On average, the execution time for our synthetic benchmarks is reduced by 50% from the simplest transformation when 8 threads are used
Task-based Parallel Breadth-First Search in Heterogeneous Environments
Abstract—Breadth-first search (BFS) is an essential graph traversal strategy widely used in many computing applications. Because of its irregular data access patterns, BFS has become a non-trivial problem hard to parallelize efficiently. In this paper, we introduce a parallelization strategy that allows the load balancing of computation resources as well as the execution of graph traversals in hybrid environments composed of CPUs and GPUs. To achieve that goal, we use a fine-grained task-based parallelization scheme and the OmpSs programming model. We obtain processing rates up to 2.8 billion traversed edges per second with a single GPU and a multi-core processor. Our study shows high processing rates are achievable with hybrid environments despite the GPU communication latency and memory coherence. I
Employing Nested OpenMP for the Parallelization of Multi-Zone Computational Fluid Dynamics Applications
In this paper we describe the parallelization of the multi-zone code versions of the NAS Parallel Benchmarks employing multi-level OpenMP parallelism. For our study we use the NanosCompiler, which supports nesting of OpenMP directives and provides clauses to control the grouping of threads, load balancing, and synchronization. We report the benchmark results, compare the timings with those of different hybrid parallelization paradigms and discuss OpenMP implementation issues which effect the performance of multi-level parallel applications
- …